rawdata <- readr::read_csv("DATA2X02 class survey 2020 (Responses) - Form responses 1.csv")
data <- rawdata %>%
  janitor::clean_names() %>%
  mutate(
    timestamp = lubridate::dmy_hms(timestamp)
  ) %>%
  rename(
    covid = how_many_times_have_you_been_tested_for_covid,
    postcode = postcode_of_where_you_live_during_semester,
    dentist = how_long_has_it_been_since_you_last_went_to_the_dentist,
    uni_work = on_average_how_many_hours_per_week_did_you_spend_on_university_work_last_semester,
    social_media = what_is_your_favourite_social_media_platform,
    dog_cat = did_you_have_a_dog_or_a_cat_when_you_were_a_child,
    parents = do_you_currently_live_with_your_parents,
    exercise = how_many_hours_a_week_do_you_spend_exercising,
    eye_colour = what_is_your_eye_colour,
    asthma = do_you_have_asthma,
    paid_work = on_average_how_many_hours_per_week_did_you_work_in_paid_employment_in_semester_1,
    season = what_is_your_favourite_season_of_the_year,
    shoe = what_is_your_shoe_size,
    height = how_tall_are_you,
    floss = how_often_do_you_floss_your_teeth,
    glasses = do_you_wear_glasses_or_contacts,
    hand = what_is_your_dominant_hand,
    steak = how_do_you_like_your_steak_cooked,
    stress = on_a_scale_from_0_to_10_please_indicate_how_stressed_you_have_felt_in_the_past_week
  ) %>%
  mutate(
    gender = case_when(
      grepl("f", tolower(gender), fixed=T) ~ "Female",
      grepl("m", tolower(gender), fixed=T) ~ "Male",
      TRUE ~ "Non-binary"
    ) %>% as.factor(),
    dentist = factor(dentist, levels = c("Less than 6 months", "Between 6 and 12 months", 
                                             "Between 12 months and 2 years", "More than 2 years", NA),
                         ordered = T) %>% 
            recode(
                `Less than 6 months` = "< 6 months",
                `Between 6 and 12 months` = "6 - 12 months",
                `Between 12 months and 2 years` = "12 months - 2 years", 
                `More than 2 years` = "> 2 years"
            ),
    dog_cat = as_factor(dog_cat),
    parents = as_factor(parents),
    postcode = as.factor(postcode),
    asthma = as_factor(asthma),
    season = factor(season, levels = c("Summer", "Autumn", "Winter", "Spring"), ordered = T),
    floss = factor(floss, levels = c("Less than once a week", "Weekly", "Most days", 
                                        "Every day", NA), ordered = T),
    glasses = as_factor(glasses),
    hand = case_when(
      hand == "Right handed" ~ "Right",
      hand == "Left handed" ~ "Left",
      TRUE ~ "Ambidextrous"
    ) %>% as_factor(),
    steak = factor(steak, levels = c(
      "Rare", "Medium-rare", "Medium", "Medium-well done", "Well done",
      "I don't eat beef"
    ), ordered = T)
  )

1 Experiment design

1.1 Bias

  • Selection bias / Sampling bias: the sample does not accurately represent the population. (e.g. As the class survey is published on Ed, students who spend more time on checking Ed post are more likely to complete the survey.)

  • Non-response bias: certain groups are under-represented because they elect not to participate

  • Measurement or designed bias: bias factors in the sampling method influence the data obtained.

1.2 Controlled experiments vs Observational studies

randomized controlled double-blind study:

  • investigator randomly allocate the subject into a treatment group and a control group. The control group is given a placebo but both the subject and investigators don’t know the identity of the groups.

  • the design is good because we expect the 2 groups to be similar thus any difference in the responses is likely to be caused by the treatment.

controlled vs observational:

  • A good randomized controlled experiment can establish causation

  • An observational study can only establish association. It may suggest causation but can’t prove causation.

1.3 Confounding

Confounding occurs when the treatment group and control group differ by some third variable than the treatment) which influences the response that is studied.

  • if any of the subjects drop out, causing selection bias or survivor bias.
  • if not all subjects keep taking treatment or placebo, the confounding of adherers and non-adherers occurs.

Controlling for confounding: make groups more comparable by dividing them into subgroups with respect to the confounders. (e.g. if alcohol consumption is a potential confounding factor, then divide subjects into heavy drinkers, medium drinkers and light drinkers)

  • limitation of controlling:
    • this can be limited by our ability to identify all confounders and then divide the study by the confounders.
    • This explains the long time to establish that smoking causes lung cancer. Researchers needed to control for factors such as health, nests, diet, lifestyle, environment etc.

1.4 Simpson’s Paradox

  • A clear trend in individual groups of data disappears when the groups are pooled together.
  • It occurs when relationships between percentages in subgroups are reversed when the subgroups are combined, because of a confounding or lurking variable.

2 Data cleaning

Visualising the missingness in the data.

Figure 2.1: Visualising the missingness in the data.

3 Performance evaluation

  • \(D_+\): the event that an individual has a particular disease
  • \(D_-\): the event that an individual does not have a particular disease
  • \(S_+\): represent a positive screening test result.
  • \(S_-\): represent a negative screening test result.

False negative rate is 0.12; False positive rate is 0.35; Sensitivity/Recall is 1; Specificity is 0.65; Precision/Positive predictive value is 0.9; Negative predictive value is 0.58; Accuracy is 0.83.

4 Measure of risk

Prospective / cohort study: subjects are initially identified as disease-free and classified by presence or absence of a risk factor.

  • one sample from the risk factor group \(R^+\)
  • another sample from the non-risk factor group \(R^-\)

Retrospective / case control study: take random samples from each of the two outcome categories which are followed back to determine the presence or absence of the risk factor.

  • one sample from the disease group \(D^+\)
  • another sample from the non-disease group \(D^-\)

Relative risk: The relative risk is the ratio of the probability of having the disease in the group with the risk factor to the probability of having the disease in the group without the risk factor. \(RR=\frac{P(D^+|R^+)}{P(D^+|R^-)}\)

  • For Prospective study only.
  • Not for Retrospective study because the proportions of cases with \(D^+\) and \(D^-\) were decided by the investigator, which means we cannot estimate \(P(D^+|R^+)\) and \(P(D^+|R^-)\).
  • RR = 1 - there is no difference between the two groups; RR < 1 the disease is less likely to occur in the group with the risk factor; RR > 1 the disease is more likely to occur in the group with the risk factor;

Odds ratio:

  • for both Prospective and Retrospective studies
  • OR = 1 if and only if risk factor and disease are independent; OR > 1 the disease is less likely to occur in the group with the risk factor; OR < 1 the disease is less likely to occur in the group with the risk factor;
  • standard error = \(\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}\)
  • confidence interval (at 95%): if CI includes 1, risk factor & disease are independent. \(CI = exp(log(\hat{OR})\pm 1.96\times\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}})\)

5 Random variables

Expectations of random variable:

  • \(E(X)=\mu\) - \(\mu\) is the population mean
  • \(Var(X)=\sigma^2\) - \(\sigma^2\) is the population variance
  • \(SD(X)=\sigma\)
  • \(E(cX)=cE(X)\)
  • \(Var(cX)=c^2Var(X)\)

For \(T=\sum^n_{i=1}X_i\):

  • \(E(T)=E(X_1+...+X_n)=E(X_1)+...+E(X_n)=\mu+...+\mu=n\mu\)
  • \(Var(T)=Var(X_1+...+X_n)=Var(X_1)+...+Var(X_n)=\sigma^2+...+\sigma^2=n\sigma^2\)

Sampling:

  • sample mean: \(E(\bar X)=E(\frac{1}{n}T)=\frac{1}{n}E(T)=\frac{1}{n}n\mu=\mu\).
  • sample variance: \(Var(\bar X)=Var(\frac{1}{n}T)=(\frac{1}{n})^2 Var(T)=\frac{1}{n^2}n\sigma^2=\frac{\sigma^2}{n}\)
  • standard error: \(SE=SD(\bar X)=\sqrt{Var(\bar X)}=\frac{\sigma}{\sqrt n}\)
  • estimated SE: \(\hat{SE}=\frac{s}{\sqrt n}\), where \(s^2=\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar x)^2\)

Importance of SE: it tells us the likely size of the estimation error so that we know how accurate or reliable the estimate is.

6 Critical value & Confidence intervals

Two-sided discrepancies of interest:

  • t-test: \(|\bar x-\mu_0|>c\frac{c}{\sqrt n}\)
  • Confidence interval: \(\bar x \pm c\frac{s}{\sqrt n}\) - the set of plausible values for unknown \(\mu\).

False alarm rate / Significance level: a given value \(\mu_0\) is rejected incorrectly.

Normal population: use the t-distribution: under the special statistical model where the data are modeled as values taken by iid normal random variables, if the true population mean is indeed \(\mu_0\), then the ratio \(\frac{\bar X-\mu_0}{S/ \sqrt n}\)~\(t_{n-1}\). Thus we choose c such that \(P(|\bar X-\mu_0|>c\frac{S}{\sqrt n})=P(\frac{|\bar X-\mu_0|}{S/\sqrt n}>c)=P(t_{n-1}>c)=\alpha\).

6.1 Finding quantiles in R

6.2 Calculate CI by t.test()

## 
##  One Sample t-test
## 
## data:  data$uni_work
## t = -75.688, df = 137, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 130
## 99 percent confidence interval:
##  24.83204 31.84912
## sample estimates:
## mean of x 
##  28.34058
## 
##  One Sample t-test
## 
## data:  data$uni_work
## t = -75.688, df = 137, p-value < 2.2e-16
## alternative hypothesis: true mean is less than 130
## 99.5 percent confidence interval:
##      -Inf 31.84912
## sample estimates:
## mean of x 
##  28.34058

6.3 Critical value decision rule

For a test of \(H_0:\mu=\mu_0\) vs \(H_1:\mu>\mu_0\):

  • for \(H_1:\mu>\mu_0\), reject \(H_0\) if \(t_0\geq t_{n-1}(1-\alpha)\)
  • for \(H_1:\mu<\mu_0\), reject \(H_0\) if \(t_0\leq t_{n-1}(\alpha)\)
  • for \(H_1:\mu\neq\mu_0\), reject \(H_0\) if \(|t_0|\geq |t_{n-1}(\alpha/2)|\); don’t reject \(H_0\) if \(|t_0|< |t_{n-1}(\alpha/2)|\)

6.4 Hypothesis test using rejection region

Hypothesis: \(H_0:\mu=375\) vs \(H_1:\mu<375\)

Assumptions: \(X_i\) are independently and identically distributed and follow \(N(\mu,\sigma^2)\)

Test statistic: \(T=\frac{\bar X-\mu_0}{S/ \sqrt n}\). Under \(H_0\), test statistic follows a t distribution with $n-1=137 degree of freedom.

Rejection region:

\(\frac{\bar X-\mu}{s/\sqrt n}<t_{n-1}(0.05)\)

\(\bar X<\mu+t_{n-1}(0.05)s/\sqrt n\)

\(\bar X<\mu+-1.66 \times 15.78/ \sqrt 138\)

\(\bar X<132.22\)

Decision:

  • The observed sample mean, \(\bar x=28.34\) is greater than 132.22, so do not reject \(H_0\).
  • The observed sample mean, \(\bar x=28.34\) is smaller than 132.22, so we reject \(H_0\).

8 Goodness of fit test

8.0.1 Do the data obtained in line with the claim?

Hypothesis: \(H_0:p_1=p_{10},p_2=p_{20},...,p_k=p_{k0}\) vs \(H_1:\) at least one equality does not hold.

Assumptions: - independent observations - expected counts are all greater than 5 (i.e. \(e_i=np_{i0}\geq 5\))

## [1] TRUE TRUE TRUE

Test statistic: \(T = \sum_{i=0}^k\frac{(Y_{i} - e_{i})^2}{e_{i}}\).

Observed test statistic: \(t_0 = \sum_{i=0}^k\frac{(y_{i} - e_{i})^2}{e_{i}}\) = 0.1

## 
##  Chi-squared test for given probabilities
## 
## data:  y_i
## X-squared = 0.096342, df = 2, p-value = 0.953

P-value:\(P(T\geq t_0)=P(X^2_{`k-1-q`}\geq\) 0.1) = 0.953

Decision:

  • Since the p-value < 0.05, there is strong evidence in the data against \(H_0\).
  • Since the p-value > 0.05, the null hypothesis is not rejected.

8.0.2 Whether the data follows a Poisson distribution

With the identity of a Poisson distribution that \[X \sim \text{Poisson}(\lambda) \Longrightarrow E[X] = \lambda\]. The estimated lambda is calculated by: \[ \hat{\lambda} = \bar{x} = \sum_{k=1}^n\frac{x_k}n \]

Figure 8.1: Data with overlayed Poisson distribution

Table 8.1: Data and expected counts
Number of Tests Observed Count Expected Counts from Poisson
0 101 80.72
1 22 43.29
2 8 11.61
3 2 2.07
4 1 0.28
5 2 0.03
6 1 0.00
10 1 0.00

Hypothesis: \(H_0:p_1=p_{10},p_2=p_{20},...,p_k=p_{k0}\) vs \(H_1:\) at least one equality does not hold.

Assumptions: - independent observations - expected counts are all greater than 5 (i.e. \(e_i=np_{i0}\geq 5\))

## [1] 0.5362319
## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Test statistic: \(T = \sum^k_{i=1}\frac{(Y_i-np_i)^2}{np_i}\), under \(H_0\), degree of freedom is \(k-1-q\) = 1 , where k is the number of groups and q is the number o f parameters that needs to be estimated from the data.

Observed test statistic: With the observed frequencies \(y_i\) from the data and estimated parameter \(\lambda\) = 0.5362, \(t_0\) = 15.63

P-value: \(P(T\geq t_0)=P(\chi^2_{1}\geq\) 15.63) = 10^{-4}

Decision

  • Since the p-value is greater than 0.05, we do not reject the null hypothesis. The data are consistent with a Poisson distribution.

  • Since the p-value is smaller than 0.05, we reject the null hypothesis. The data does not follow a Poisson distribution.

9 Test of homogeneity

Test of homogeneity: Test whether the probability distributions of the categories are the same over the different populations.

9.1 Chi-sqaured test

Hypothesis: \(H_0:p_{1j}=p_{2j}\) for \(j=1,2,3\) vs \(H_1: p_{11}\neq p_{21}, p_{21}\neq p_{22}\).

##      Female Male Non-binary
## [1,]   TRUE TRUE      FALSE
## [2,]   TRUE TRUE      FALSE

Assumptions: \(e_{ij}=\frac{y_iy_j}{n} \geq 5\).

Test statistic: \(T=\sum^r_{i=1}\sum^c_{j=1}\frac{Y_{ij}-e_{ij}^2}{e_{ij}}\). Under \(H_0\), the degree of freedom is \((r-1)(c-1)\) = 1, where r is the number of rows and c is the number of columns in contingency table.

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 11.635, df = 2, p-value = 0.002975

Observed test statistic: \(\sum^{2}_{i=1}\sum^{3}_{j=1}\frac{y_{ij}-e_{ij}^2}{e_{ij}}\).

P-value: \(P(T \geq 11.63) = P(\chi^2_{2} \geq 11.63) = 0.003\)

Decision:

  • Since the p-value is greater than 0.05, we do not reject the null hypothesis. There is no significant difference in … between .. and ..
  • Since the p-value is smaller than 0.05, we reject the null hypothesis. There is significant difference in … between .. and ..

10 Test of independence

10.1 Chi-squared test

Hypothesis: \(H_0:p_{ij}=p_ip_j\) for \(i=1,2,...,r,j=1,2,...,c\) vs \(H_1:\) Not all equalities hold.

##      Female Male Non-binary
## [1,]   TRUE TRUE      FALSE
## [2,]   TRUE TRUE      FALSE

Assumptions: all expected counts are greater or equal to 5 (i.e.\(e_{ij}=\frac{y_iy_j}{n} \geq 5\))

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 11.635, df = 2, p-value = 0.002975

Test statistic: \(T=\sum^r_{i=1}\sum^c_{j=1}\frac{(Y_{ij}-e_{ij})^2}{e_{ij}}\). Under \(H_0\), the degree of freedom is \((r-1)(c-1)=\) 2, where c is the number of columns and r is the number of rows in the contingency table.

Observed statistic: \(t_0=\sum^{2}_{i=1}\sum^{3}_{j=1}\frac{(y_{ij}-{y_iy_j/n})^2}{y_iy_j/n}=\) 11.63

P-value: \(P(T\geq 11.63) = P(\chi^2_{2}\geq 11.63) = 0.003\)

Decision:

  • Since the p-value is greater than 0.05, we do not reject the null hypothesis. There is no association between … and …
  • Since the p-value is smaller than 0.05, we reject the null hypothesis. There is no an association between … and …

11 Test in small samples (cell counts < 5)

11.1 Fisher’s exact test - 2 by 2 table

Fisher’s exact test: Consider all possible permutations of 2 by 2 contingency table whit the same marginal totals. Then calculate how many of these were equal to or “more extreme” than what we observed. As such, the Fisher’s exact test does not require the expected cell counts to be \(\geq 5\).

Drawbacks:

  • It assumes that row and column margins are fixed.
  • Need to use Monte Carlo for large contingency tables.

Hypothesis: \(H_0:\) there is no association between … and … vs \(H_1:\) there is an association between … and …

Assumptions: Consider all possible permutations of 2 by 2 contingency table whit the same marginal totals. Then calculate how many of these were equal to or “more extreme” than what we observed. As such, the Fisher’s exact test does not require the expected cell counts to be \(\geq 5\).

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab
## p-value = 0.00157
## alternative hypothesis: greater

The degrees of freedom is not calculated as it is not relevant to fisher’s exact test.

Decision:

  • P-value = 0.0016 \(<0.05\), \(H_0\) is rejected therefore there is an association between … and …
  • P-value = 0.0016 \(>0.05\), \(H_0\) is not rejected therefore there is no association between … and …

11.2 Yate’s corrected chi-squared test

Yate’s correction: Apply continuity correction with a chi-squared test, using the identity \(P(X\leq x) \approx P(Y\leq x+0.5)\) and \(P(X\geq x) \approx P(Y\geq x-0.5)\).

Hypothesis: \(H_0:\) there is no association between … and … vs \(H_1:\) there is an association between … and …

Assumption: Yate’s correction applies continuity correction to approximate integer-valued variable, therefore, it does not restrict the cell counts.

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 11.635, df = 2, p-value = 0.002975

Test statistic: \(T=\sum^r_{i=1}\sum^c_{j=1}\frac{(|Y_{ij}-e_{ij}|-0.5)^2}{e_{ij}}\), which approximately follows a \(\chi^2_{(r-1)(c-1)}\) distribution under \(H_0\). The degree of freedom is \((r-1)(c-1)=\) 2, where r is the number of rows and c is the number of columns in the contingency table.

Observed test statistic: \(t_0=\sum^{2}_{i=1}\sum^{3}_{j=1}\frac{(|y_{ij}-e_{ij}|-0.5)^2}{e_{ij}}\) = 11.63.

P-value: \(P(T\geq 11.63) = P(\chi^2_{2}\geq 11.63) = 0.003\)

Decision:

  • As p-value \(<0.05\), \(H_0\) is rejected therefore there is an association between … and …
  • As p-value \(>0.05\), \(H_0\) is not rejected therefore there is no association between … and …

11.3 Monte Carlo simulation

Monte Carlo simulation: Resample (i.e. randomly generate contingency tables) and perform chi-squared tests by many times. Calculate the test statistic for each of the resamples and create a sampling distribution of test statistics. P-value is calculated by determining the proportion of the resampled test statistics \(\geq\) the observed test statistic.

Hypothesis: \(H_0:\) there is no association between … and … vs \(H_1:\) there is an association between … and …

Assumptions: No assumptions are made about the underlying distribution of the population. The cell counts are also not restricted for Monte Carlo simulation.

To calculate the p-value, a Monte Carlo simulation is perform, with 10000 simulations of chi-squared test. Note that degree of freedom is not considered as it is not relevant to the Monte Carlo simulation.

## 
##  Pearson's Chi-squared test with simulated p-value (based on 10000
##  replicates)
## 
## data:  tab
## X-squared = 11.635, df = NA, p-value = 0.002

Test statistics: the test statistic is calculated for each of the resamples by \(T=\sum^r_{i=1}\sum^c_{j=1}\frac{(Y_{ij}-e_{ij})^2}{e_{ij}}\).

Observed test statistic: \(t_0=\sum^r_{i=1}\sum^c_{j=1}\frac{(y_{ij}-e_{ij})^2}{e_{ij}}=\) 11.63

P-value: \(P(T\geq\) 11.63) = \(P(\chi^2\geq\) 11.63) = 0.002

Decision:

  • As p-value \(<0.05\), \(H_0\) is rejected therefore there is an association between … and …
  • As p-value \(>0.05\), \(H_0\) is not rejected therefore there is no association between … and …

12 T-test

12.1 One sample t-test

Distribution of data with blue line indicating the tested mean of ...

Figure 12.1: Distribution of data with blue line indicating the tested mean of …

The data doesn’t follow a normal distribution if:

  • A lower bend of zero exist?
  • Not close enough to the line to be considered ‘normal’.

From Figure 16.1, …

Hypothesis: \(H_0: \mu =30\) vs \(H_1: \mu ><\neq 30\)

Assumptions:

  • Each observation is chosen at random from a population.
  • Variables are independently and identically distributed and follow \(N(\mu,\sigma^2)\)
## 
##  One Sample t-test
## 
## data:  data$uni_work
## t = -75.688, df = 137, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 130
## 95 percent confidence interval:
##  25.68461 30.99655
## sample estimates:
## mean of x 
##  28.34058

Test statistic: \(T=\frac{\bar{X}-\mu_0}{s/\sqrt{n}}\). Under \(H_0\), the data tends to follow t distribution with \(n-1=\) 137 degree of freedom.

Observed test statistic: \(t_0=\frac{28.34-30}{15.78/\sqrt{138}}=-75.69\)

P-value: \(P(t_{137}\leq -75.69)=0\)

Decision:

  • The p-value is smaller than 0.05, we reject the null hypothesis. The mean of … is equal to 30.
  • The p-value is greater than 0.05, we does not reject the null hypothesis. The mean … is not equal to / greater than / less than 30.

12.2 Two-sample t-test

Two-sample t-test: test whether the population mean of two samples are different.

Welch two-sample t-test: does not assume equal population variances.

Table 12.1: Statistics of number of hours spent on exercising by different genders.
Gender Mean Median SD Variance Counts
Female 3.4 3 2.9 8.4 46
Male 5.0 5 3.4 11.7 91
Non-binary 5.0 5 NA NA 1
Distribution of ...

Figure 12.2: Distribution of …

Q-Q plots of data in different groups.

Figure 12.3: Q-Q plots of data in different groups.

The data doesn’t follow a normal distribution if:

  • A lower bend of zero exist?
  • Not close enough to the line to be considered ‘normal’.

Hypothesis: \(H_0:\mu_x=\mu_y\) vs \(H_1:\mu_x >\mu_y\) or \(\mu_x \leq \mu_y\) or \(\mu_x \neq \mu_y\)

Assumptions: - Variables \(X,Y\) are identically and independently distributed and follow \(N(\mu_X,\sigma^2)\) and \(N(\mu_Y,\sigma^2)\). - Observations are independent to each other - Regular two-sample t-test assumes equal population variances of variables while Welch two-sample t-test does not assume equal variance.

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 0, df = 274, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7864101  0.7864101
## sample estimates:
## mean of x mean of y 
##  4.478261  4.478261

Test statistic: \(T=\frac{\bar{X}-\bar{Y}}{S_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\), where \(S^2_p=\frac{(n_x-1)S^2_x+(n_y-1)S^2_y}{n_x+n_y-2}\). Under \(H_0\), the data tend to follow a t distribution with 274 degree of freedom.

Observed test statistic: \(t_0=\frac{4.48 - 4.48}{3.32 \sqrt{\frac{1}{138}+\frac{1}{138}}}\), where \(S2_p=\frac{(138 -1)3.32^2+(138-1)3.32^2}{138+138 -2}=0\)

P-value: \(2P(t_{274 \geq |0|})=1\) or \(P(t_{274 \leq |0|})=1\) or \(P(t_{274 \geq |0|})=1\)

Decision:

  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are the same.
  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are different.

12.3 Paired samples t-test

Paired samples t-test: measure twice with the same population. (e.g. Blood samples from individuals before and after they smoked a cigarette). The differences between two variables are usually calculated to perform one-sample t-test.

The data doesn’t follow a normal distribution if: - A lower bend of zero exist? - Not close enough to the line to be considered ‘normal’.

Hypothesis:\(H_0:\mu_d=0\) vs \(H_1:\mu_d\neq 0\)

Assumptions: differences between two samples are independent and identically distributed, with the identity \(N(\mu_d,\sigma^2)\).

## 
##  Paired t-test
## 
## data:  after and before
## t = 2.6374, df = 8, p-value = 0.02984
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   1.10285 16.45271
## sample estimates:
## mean of the differences 
##                8.777778

Test statistic: \(T=\frac{\bar D}{S_d / \sqrt{n}}\). Under \(H_0\), test statistic tends to follow a t distribution with \(n-1=8\) degree of freedom.

Observed test statistic: \(t_0=\frac{8.78}{9.98/ \sqrt{9}}\)

P-value: \(2P(t_{8 \geq |2.6373657|})=0.0298\) or \(P(t_{8 \leq |2.6373657|})=0.0298\) or \(P(t_{8 \geq |2.6373657|})=0.0298\)

Decision:

  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are the same.
  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are different.

13 Sign test

Sign test: is used to test \(H_0:\mu = \mu_0\) and paired data when normality is not satisfied.

  • drawback: it ignores all the information on magnitude and hence has low power.

  • If \(H_0\) is true, the probability \(p_+\), of getting a positive \(D_i\) where \(D_i=X_i-\mu_0\).
  • The sign test reduces to a binomial test of proportions.
  • The sign test is a non-parametric test as no assumption on the data distribution is made except symmetry.

13.1 Sign test for one-sample mean

Distribution of the data.

Figure 13.1: Distribution of the data.

## 
##  Exact binomial test
## 
## data:  freq
## number of successes = 65, number of trials = 117, p-value = 0.1336
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.4753134 1.0000000
## sample estimates:
## probability of success 
##              0.5555556

Hypothesis: \(H_0:\mu=30\) vs \(H_1:\mu>30, \mu<30,\mu \neq 30\)

Assumptions: Observations are independently sampled from a symmetric distribution.

Test statistic: \[T=number~of (D_i>0)\] where \(D_i=X_i-30\). Under \(H_0\), the test statistic follows a binomial distribution with identity \(B(n,\frac{1}{2})\) where n is the number of non-zero differences.

Observed test statistic: \(t_0=number~of(d_i>0)=52\)

P-value:

  • \(H_1:\mu<\mu_0\) - \(P(T \leq 52)=0.9023\)
  • \(H_1:\mu>\mu_0\) - \(P(T \geq 52)=0.9023\)
  • \(H_1:\mu \neq \mu_0\) & \(t_0<\frac{n}{2}\) - \(2P(T \leq 52)=0.9023\)
  • \(H_1:\mu \neq \mu_0\) & \(t_0>\frac{n}{2}\) - \(2P(T \geq 52)=0.9023\)

Conclusion:

  • The p-value is smaller than 0.05, we reject the null hypothesis. The mean of … is equal to 30.
  • The p-value is greater than 0.05, we does not reject the null hypothesis. The mean … is not equal to / greater than / less than 30.

13.2 Sign test for paired data

Sign test can be used to test differences between paired data when normality is not satisfied.

Hypothesis: \(H_0:p_+=\frac{1}{2}\) vs \(H_1:p_+>\frac{1}{2}\)

Assumptions: Differences \(D_i\) are independent.

## 
##  Exact binomial test
## 
## data:  t0 and n
## number of successes = 7, number of trials = 9, p-value = 0.08984
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.4503584 1.0000000
## sample estimates:
## probability of success 
##              0.7777778

Test statistic: Let \(T\) be the number of positive differences out of the 9 non-zero differences. Under \(H_0\), the test statistic follows a binomial distribution with the identity \(B(9,\frac{1}{2})\).

Observed test statistic: We observed \(t_0=7\) positive differences in the sample.

P-value: probability of getting a test statistic as or more extreme than what we observed, \(P(T \geq 7)=1-P(T \leq 6)=1-pbinom(6,size=9,prob=\frac{1}{2})\approx 0.0898\)

Conclusion:

  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are the same.

  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are different.

14 Wilcoxon signed-rank test - for one sample & paired data

14.1 Using wilcox.test

Distribution of the data.

Figure 14.1: Distribution of the data.

Hypothesis: \(H_0:\mu=30\) vs \(H_1:\mu>30, \mu<30,\mu \neq 30\)

Assumptions: Observations are independently sampled from a symmetric distribution.

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  diff
## V = 2886.5, p-value = 0.9395
## alternative hypothesis: true location is greater than 0

Test statistic:

  • for one-sided: \(W^+=\sum_{i:D_i>0}R_i\) -for two-sided: \(W=min(W^+,W^-)\)

Observed test statistic:

  • for one-sided: \(t_0=w^+=2886.5\)
  • for two-sided: \(t_0=min(w^+,w^-)=2886.5\)

P-value:

  • \(H_1:\mu<\mu_0\) - \(P(W^+ \leq 2886.5)=0.9395\)
  • \(H_1:\mu>\mu_0\) - \(P(W^+ \geq 2886.5)=0.9395\)
  • \(H_1:\mu \neq \mu_0\)- \(2P(W^+ \leq 2886.5)=0.9395\)

Conclusion:

  • The p-value is smaller than 0.05, we reject the null hypothesis. The mean of … is equal to 30.
  • The p-value is greater than 0.05, we does not reject the null hypothesis. The mean … is not equal to / greater than / less than 30.

14.2 Using normal approximation

For large enough \(n\), we can use normal distribution to approximate the distribution of the sign rank test statistic: \(W^+\)~\(N(\frac{n(n+1)}{4},\frac{n(n+1(2n+1))}{24})\)

Distribution of the data.

Figure 14.2: Distribution of the data.

Hypothesis: \(H_0:\mu=30\) vs \(H_1:\mu>30, \mu<30,\mu \neq 30\)

Assumptions: Observations are independently sampled from a symmetric distribution.

## [1] -1.736364
## [1] -1.736364

Test statistic: \(W=min(W^+,W^-)\) where \(W^+=\sum_{i:D_i>0}R_i\), \(W^-=\sum_{i:D_i<0}R_i,\) \(D_i=X_i-30\) and \(R_i\) are the ranks of \(|D_1|,|D_1|,...,|D_n|\). Under \(H_0\), the test statistic, with identity \(W^+\)~\(WSR(138)\), follows a symmetric distribution with mean \(E(W^+)=\frac{n(n+1)}{4}=4795.5\) and \(Var(W^+)=\frac{n(n+1)(2n+1)}{24}=2.2139225\times 10^{5}\).

Observed test statistic: the test statistic is found by determining differences between observations and 30 \(D_i=X_i-\mu_0\), followed by assigning the signed ranks of \(D_i\). The sum of positive ranks (\(w^+\)) and the sum of negative ranks (\(w^-\)) are calculated.

  • We have a two-sided alternative, so the test statistic is \(w=min(w^+,w^-)=3978.5\).
  • We have a one-sided alternative, so the test statistic is \(w=w^+=3978.5\).

By using normal approximation, test statistic can be calculated by: \(t_0=\frac{w-E(w^+)}{\sqrt{Var(w^+)}}=\frac{3978.5-4795.5}{\sqrt{2.2139225\times 10^{5}}}=-1.74\)

P-value:

  • \(H_1:\mu<\mu_0\) - \(P(W^+ \leq -1.74)\approx P(Z\leq\frac{3978.5-E(W^+)}{\sqrt{Var(W^+)}})=P(Z\leq\frac{3978.5-4795.5}{\sqrt{2.2139225\times 10^{5}}}) =P(Z\leq -1.74)=0.0825\)
  • \(H_1:\mu>\mu_0\) - \(P(W^+ \geq -1.74)=1-P(W^+ \leq -1.74)\approx 1-P(Z\leq\frac{3978.5-E(W^+)}{\sqrt{Var(W^+)}})=1-P(Z\leq\frac{3978.5-4795.5}{\sqrt{2.2139225\times 10^{5}}}) =1-P(Z\leq -1.74)=0.0825\)
  • \(H_1:\mu \neq \mu_0\)- \(2P(W^+ \leq -1.74)\approx 2P(Z\leq\frac{3978.5-E(W^+)}{\sqrt{Var(W^+)}})=2P(Z\leq\frac{3978.5-4795.5}{\sqrt{2.2139225\times 10^{5}}}) =2P(Z\leq -1.74)=0.0825\)

Conclusion:

  • The p-value is smaller than 0.05, we reject the null hypothesis. The mean of … is equal to 30.
  • The p-value is greater than 0.05, we does not reject the null hypothesis. The mean … is not equal to / greater than / less than 30.

15 Wilcoxon rand-sum test - for two samples